This assignment will explore non-personalized recommendations. You will be given a 20x20 matrix where columns represent movies, rows represent users, and each cell represents a user-movie rating.
There are 4 deliverables for this assignment. Each deliverable represents a different analysis of the data provided to you. For each deliverable, you will submit a list of the top 5 movies as ranked by a particular metric. The 4 metrics are:
In [114]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
%matplotlib inline
In [115]:
# Loading the data into a Pandas dataframe
movie_data = pd.read_csv('A1Ratings.csv')
In [116]:
# Looking at the first 5 rows of the dataframe
movie_data.head()
Out[116]:
In [117]:
#printing the column names of the dataframe
movie_data.columns
Out[117]:
In [118]:
# Summarizing the data in the movie_data dataframe
movie_data.describe()
Out[118]:
In [119]:
# Storing the "1198: Raiders of the Lost Ark (1981)" data into an array
raid_lost_arc = movie_data["1198: Raiders of the Lost Ark (1981)"]
raid_lost_arc
Out[119]:
Mean rating for Raiders of the Lost Ark (1981)
In [120]:
print '%.2f' % ( raid_lost_arc.mean() )
Number of non-NA ratings for Raiders of the Lost Ark (1981)
In [121]:
raid_lost_arc.count()
Out[121]:
Percentage of ratings >=4 for Raiders of the Lost Ark (1981)
In [122]:
print '%.1f' % ( (len(raid_lost_arc[raid_lost_arc>=4])/float(raid_lost_arc.count()))*100.0 )
Finding Association of Raiders of the Lost Ark (1981) with Star Wars Episode IV. The association with Star Wars Episode IV is defined as the number of users that rated BOTH Raiders of the Lost Ark (1981) and Star Wars Episode IV divided by the number of users that rated Star Wars Episode IV.
In [123]:
# First, storing the Star Wars count
star_wars_count = movie_data["260: Star Wars: Episode IV - A New Hope (1977)"].count()
In [124]:
# Then multiply the Raiders of the Lost Ark and Star Wars data.
# non-NA values will be the ones where both entries do not have NA. Then, count these entries
rad_arc_star_wars_count = (movie_data["1198: Raiders of the Lost Ark (1981)"]*movie_data["260: Star Wars: Episode IV - A New Hope (1977)"]).count()
Printing the Association of Raiders of the Lost Ark (1981) and Star Wars Episode IV
In [125]:
print '%.1f' % ( (rad_arc_star_wars_count/float(star_wars_count))*100.0 )
Making a Pandas Series with the index name equal to the movie and the entry equal to the mean rating for each movie. Sliced the column names from of the movie_data dataframe from [1:] since the first column is the user id.
In [126]:
rating_means = pd.Series([movie_data[col_name].mean() for col_name in movie_data.columns[1:]],
index=movie_data.columns[1:])
Printing the top 5 rated movies
In [127]:
rating_means.sort_values(ascending=False)[0:5]
Out[127]:
Making a Pandas Series with the index name equal to the movie and the entry equal to the number of non-Na ratings for each movie. Sliced the column names from of the movie_data dataframe from [1:] since the first column is the user id.
In [128]:
rating_count = pd.Series([movie_data[col_name].count() for col_name in movie_data.columns[1:]],
index=movie_data.columns[1:])
Printing the top 5 movies with the most ratings
In [129]:
rating_count.sort_values(ascending=False)[0:5]
Out[129]:
Making a Pandas Series with the index name equal to the movie and the entry equal to the number of non-Na ratings for each movie. Sliced the column names from of the movie_data dataframe from [1:] since the first column is the user id.
In [130]:
rating_positive = pd.Series([sum(movie_data[col_name]>=4)/float(movie_data[col_name].count()) for col_name in movie_data.columns[1:]],
index=movie_data.columns[1:])
Printing Top 5 movies with Percentage of ratings >=4
In [131]:
rating_positive.sort_values(ascending=False)[0:5]
Out[131]:
In [132]:
# First, storing the Star Wars ratings and the count of non-NA Star Wars ratings
star_wars_rat = movie_data["260: Star Wars: Episode IV - A New Hope (1977)"]
star_wars_count = float(movie_data["260: Star Wars: Episode IV - A New Hope (1977)"].count())
print star_wars_count
Finding Association of all movies with Star Wars Episode IV. The association with Star Wars Episode IV is defined as the number of users that rated BOTH movie i and Star Wars Episode IV divided by the number of users that rated Star Wars Episode IV. Below, we are looping over [2:] to not include Star Wars Episode IV in the Association calculation.
In [133]:
sim_val = pd.Series( [ (movie_data[col_name]*star_wars_rat).count()/star_wars_count
for col_name in movie_data.columns[2:] ], index=movie_data.columns[2:] )
Printing Top 5 movies most similar to Star Wars (movie id =260)
In [134]:
sim_val.sort_values(ascending=False)[0:5]
Out[134]:
In [ ]: